In [1]:
# Data processing
import numpy as np
import pandas as pd

# Stats
seed = np.random.SeedSequence(42)
rng = np.random.default_rng(seed=seed)
import pystan
import scipy.stats as st
from statsmodels.distributions.empirical_distribution import ECDF

# Data viz
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

# Helper functions
from stan_workup import summarize, predictive_regression
Loading BokehJS ...

Overview

This notebook contains code used to generate estimates for enzymatic parameters for Gordon Rix et al., 2020.

Workflow

Initial velocity measurements

The goal is to estimate $k_\text{cat}$ for each enzyme and $K_\text{M}$ for indole. Reaction mixtures of indole were prepared via serial dilution (where possible) to obtain the following concentration values:

concs = [500, 400, 300, 200, 100, 50, 25, 12.5, 6.25, 3.125, 1.5625, 0] # µM

with a constant concentration of 40 mM serine in 50 mM potassium phosphate buffer, pH 8.0, at 30 °C.

Enzyme was added (50 nM) and the absorbance at 290 nm was immediately recorded to observe the initial velocity of the reaction. For very low indole concentrations the velocities could be seen to level off over time (expecially when >50 nM enzyme was used) as all of the indole is consumed, so only the first minute (really 10–70 seconds) was kept to determine the intitial velocity. Although the values can be noisy, this is accounted for during the Bayesian inference process by propogating the error in the measurements during the whole process.

The absorbance difference between indole and tryptophan is known at 290 nm, allowing units of AU/time to be converted to mM/time, providing a specific rate. Since the enzyme concentration is also known, we can convert each velocity measurement to an enzyme-normalized rate.

Bayesian modeling

Generative distributions in Stan

Stan uses a generative modeling strategy to model the system and data and provide inferences. In other words, if one knows a reasonable model of the data-generating process, Stan will use a probabilitic approach to providing estimates of each parameter in the system. In this case, the data-generating process is as follows: at each indole concentration, a rate is sampled from the theoretical Michaelis-Menten curve, with some error (assumed to be Normal). If the Michaelis-Menten equation is of the form:

\begin{align} k = \frac{k_\text{cat}\text{[indole]}}{K_\text{M} + \text{[indole]}}, \end{align}

then its generative distribution can be written as:

\begin{align} k \sim \text{Norm}(\frac{k_\text{cat}\text{[indole]}}{K_\text{M} + \text{[indole]}}, σ_k), \end{align}

where $σ_k$ is the standard deviation for the Michaelis-Menten generative distribution.

Rates are not measured perfectly, however, and so each rate that is measured has its own data-generating process, depending on hardware limitations (i.e., the lamp and detector on the UV-Vis spectrophotometer) and other factors. The rates are measured as the conversion of indole to tryptophan over time (which is measured via absorbance) and should be linear, and so are simply linear regressions of the absorbance values over time, with deviations from the theoretical slope according to some normal distribution. (An alternative solution would be to model this as a stochastic differential equation.) As an additional factor, the noise of a spectrophotometer is typically heteroscedastic for large changes in absorbance, so each rate measurement is assigned its own standard deviation (as the noise should be greater for higher concentration samples than lower). For measuring absorbance over time (which can be converted to true rates as mentioned above), this data-generating process is a follows:

\begin{align} a_t = a \times t + a_0, \end{align}

where $a$ is the rate of absorbance change, $t$ is the time (in seconds), and $a_0$ is absorbance at $t=0$, and each rate is modeled as

\begin{align} a \sim \text{Norm}(m, σ_a), \end{align}

where $m$ is the estimated mean slope and $σ_a$ is the sample-specific standard deviation from the mean slope.

Priors

Bayesian inference require "priors" for the estimated parameters ($k_\text{cat}$ and $K_\text{M}$), which should be large-and non-specific, but reasonably limit the starting search space for the Markov chain Monte Carlo (MCMC) sampler (more on this below). In other words, it places initial probability density over reasonable values for these parameters, according to your previous knowledge. As long as your posterior distributions for each parameter are much more constrained than your priors, it is unlikely that your prior choices are strongly influencing your results. For $k_\text{cat}$ and $K_\text{M}$ we choose large lognormal priors, which are useful as they constrain the values to be >0 (which they will be) and, in practice, these parameters vary in log-space rather than linear-space. That is, rates and rate constants vary multiplicatively as the energy of the system varies additively, so a lognormal prior better approximates what we know about these parameters. Finally, as Stan's sampler is very efficient, as long as our model is not exceptionally complex or pathological, a reasonable posterior distribution will be found as long as the prior probability is not identically zero. (Note that is somewhat of an oversimplification.)

Sampling

The sampling procedure will pick initial parameter guesses from your prior distributions and test how well your data agree with those parameter guesses in the scope of your model. This is stored as a sample, assuming we are out of a warm-up stage. Then new parameter guesses will be made and the sampler will see how well those agree. If they are better, then they will be accepted and stored. In they are worse, they will be accepted and stored with some probability which depends on how reasonable the guess it. This is the basis of Markov chain Monte Carlo (MCMC) sampling, and it can provide an estimation of your posterior distribution, which is a quantification of your uncertainty in your parameter estimations. These samples are ideally drawn independently and so are not correlated (which can be checked in the summary and diagnostics of the samples).

With the samples, you can compute summary statistics about your parameters, such as mean, median, and credible regions, and see how predictive your parameters are for the data. This is shown in the examples below.

A toy problem

To confirm that the model returns values that we expect, we can provide generated data (according to our data-generating process) and send it through Stan. The values we output should be reasonably close to the input values.

In [2]:
# Define initial parameters
k_cat = 1 # per sec
K_M = 10 # µM

sigma_k = 0.1

# Concentrations in duplicate
# Needs to be sorted and have the zeros as the first entries
concs = np.array(sorted([500, 400, 300, 200, 100, 50, 25, 12.5, 6.25, 3.125, 1.5625, 0]*2)) # µM

# Set up function to generate data
def MM(k_cat, K_M, conc, sigma=0):
    """Theoretical Michaelis-Menten equation, using
    absolute parameters with units of µM, seconds.
    """
    k = (k_cat*conc)/(K_M + conc)
    
    noisy_k = rng.normal(k, sigma)
    
    return noisy_k


# Generate data
ks = MM(k_cat, K_M, concs, sigma_k)

# Check
ks
Out[2]:
array([ 0.03047171, -0.10399841,  0.21018025,  0.22919161,  0.04299172,
        0.10787729,  0.39739942,  0.35299113,  0.55387544,  0.47025116,
        0.80222551,  0.79206491,  0.8399364 ,  0.94605745,  0.95584184,
        0.82316166,  0.98925603,  0.85649269,  1.05558697,  0.96274934,
        0.95712352,  0.9075168 ,  1.10264629,  0.96493921])

Looks okay, let's plot it and see.

In [3]:
theor_x = np.linspace(0, 500, 1000)
theor_k = MM(k_cat, K_M, theor_x, sigma=0)

p = bokeh.plotting.figure(plot_width=500, plot_height=400)
p.xaxis.axis_label = '[indole] (µM)'
p.yaxis.axis_label = 'k (per sec)'

p.line(theor_x, theor_k, color='black', line_width=2)
p.circle(concs, ks, size=8, fill_alpha=0.5)

bokeh.io.show(p)

Looks good. We'll now take these rates and convert them to the values we actually get: change in absorbance over time.

In [4]:
# Unit conversions
c_Enz = 50 # nM
epsilon = 1.89 # mM/sec

# Convert rate to specific rate
vs = ks*(c_Enz/1000) # µM per sec

# Convert µM per sec to mAU per sec
a_rates = vs*epsilon # mAU per sec

# Check
a_rates
Out[4]:
array([ 0.00287958, -0.00982785,  0.01986203,  0.02165861,  0.00406272,
        0.0101944 ,  0.03755425,  0.03335766,  0.05234123,  0.04443873,
        0.07581031,  0.07485013,  0.07937399,  0.08940243,  0.09032705,
        0.07778878,  0.09348469,  0.08093856,  0.09975297,  0.09097981,
        0.09044817,  0.08576034,  0.10420007,  0.09118676])

With these we can now generate noisy absorbance values over time. We'll demonstrate for one of them first, and then do it with all of them.

In [5]:
# Take a 500 µM case
working_rate = a_rates[-1]

# Generate some times; 0.2 second increments
times = np.arange(0, 60, 0.2)

# Noise; use as a function of a_rate
sigma_a_factor = 5

def a_t(a_rate, time, a_0=0, sigma_factor=0):
    """Creates a noisy timecourse based on a slope and
    times, and an optional y-intercept and noise factor.
    """
    a_t = a_rate*time + a_0
    
    noisy_a_t = rng.normal(a_t, abs(sigma_factor*a_rate)+0.05)
    
    return noisy_a_t


# Get a timecourse
a_ts = a_t(working_rate, times, sigma_factor=sigma_a_factor)

# Plot
p = bokeh.plotting.figure(plot_width=500, plot_height=400)
p.xaxis.axis_label = 'time (sec)'
p.yaxis.axis_label = 'mAU'

p.line(times, a_ts, line_width=1)

bokeh.io.show(p)

Very reasonable. Now let's plot each.

In [6]:
# Set up full milliabsorbance matrix, ordered by concs
mabs = np.array([
    a_t(working_rate, times, sigma_factor=sigma_a_factor)
    for working_rate in a_rates
])

p = bokeh.plotting.figure(plot_width=800, plot_height=400)
p.xaxis.axis_label = 'time (s)'
p.yaxis.axis_label = 'mAU'

# Set up color dict
colors = bokeh.palettes.magma(len(concs))[::2]
color_dict = {
    conc: color for conc, color in zip(np.unique(concs), colors)
}

for i, conc in enumerate(concs):
    mab = mabs[i]
    
    p.line(times, mab, color=color_dict[conc], legend_label=str(conc))

bokeh.io.show(p)

Now we can send these values through the Stan sampler and see if we get reasonable estimates of our original parameters ($k_\text{cat}$ = 1, $K_\text{M}$ = 10).

Let's first compile and display the model, which is written according to what was specified in 'Generative Distributions in Stan'.

In [7]:
# Create the model and display it
model = pystan.StanModel('MM_model.stan')

print(model)
INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_9e3ec6aa343e7a351ee4db039acd0f93 NOW.
StanModel object 'anon_model_9e3ec6aa343e7a351ee4db039acd0f93' coded as follows:
functions{

  // Theoretical MM enzyme equation
  vector MM(real k_cat, real K_M, vector conc){
    return  k_cat * (conc ./ (K_M + conc));
  }

  // Observed rate, units of mAU per second
  vector a_t(real rate, vector t, real a0){
    return rate * t + a0;
  }
  
}


data{

  // Total number of data points
  int N;

  // Number of rate measurements
  int M;
  
  // Number of background rate (no substrate) measurements
  // assumed to be the first 'M0' rates in the array
  int M0;
    
  // The data
  vector[N] t;
  matrix[M,N] a;
  
  // Known constants
  vector[M] conc; // uM
  real epsilon; // mM/s
  real c_Enz; // nM
  
  // Scaling factor
  int scaling_factor;
  
  // For PPCs
  int max_conc;
  vector[max_conc+1] conc_ppc;
    
}


parameters{

  real<lower=0> V_max;

  real<lower=0> K_M;

  real<lower=0> sigma_k;
  vector<lower=0>[M] sigma_a;

  vector[M] rate;
  vector[M] a0;
  real background;

}


transformed parameters{
    
  // In units of uM/s
  vector[M] v0 = (rate/scaling_factor)/epsilon;
  
  // In units of /s
  vector[M] k = v0/(c_Enz/1000);
  real k_cat = V_max/(c_Enz/1000);

}


model{
  
  // Priors
  k_cat ~ lognormal(log(150), 2.5); // per second
  K_M ~ lognormal(log(500), 1.5); // uM
  
  background ~ std_normal();
  a0 ~ std_normal();

  sigma_k ~ std_normal();
  sigma_a ~ std_normal();

  // Likelihood
  for (m in 1:M0){
    a[m] ~ normal(a_t(background, t, a0[m]), sigma_a[m]);
  }
  
  for (m in 1:M){
    a[m] ~ normal(a_t(rate[m] + background, t, a0[m]), sigma_a[m]);
  }
  
  k ~ normal(MM(k_cat, K_M, conc), sigma_k);
  
}


generated quantities{
  
  real k_ppc[max_conc+1];
  
  // Draw posterior predictive data set
  k_ppc = normal_rng(MM(k_cat, K_M, conc_ppc), sigma_k);

}

Now we can prepare the data and do some sampling.

In [8]:
# Multiply absorbances by 100 to give rates of ~1
# This speeds up modeling, and is later factored in
scaling_factor = 100
scaled_absorbances = mabs*scaling_factor

# Store as dictionary
data = dict(
    N=len(times),
    M=len(concs),
    M0=sum(concs == 0),
    t=times,
    conc=concs,
    a=scaled_absorbances,
    scaling_factor=scaling_factor,
    epsilon=epsilon,
    c_Enz=c_Enz,
    max_conc=int(max(concs)),
    conc_ppc=range(0, int(max(concs)+1)),
)

# Sample
samples = model.sampling(data)

# Check out a summary
samples
WARNING:pystan:Truncated summary with the 'fit.__repr__' method. For the full summary use 'print(fit)'
Out[8]:
Warning: Shown data is truncated to 100 parameters
For the full summary use 'print(fit)'

Inference for Stan model: anon_model_9e3ec6aa343e7a351ee4db039acd0f93.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

              mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
V_max         0.05  2.3e-5 1.7e-3   0.05   0.05   0.05   0.05   0.05   5257    1.0
K_M          11.47    0.03   1.86   8.31  10.17  11.33  12.54  15.64   5244    1.0
sigma_k       0.09  2.1e-4   0.01   0.07   0.08   0.09    0.1   0.12   5068    1.0
sigma_a[1]    7.72  7.6e-3   0.29   7.16   7.52   7.71   7.91   8.31   1445    1.0
sigma_a[2]   16.35    0.01   0.47  15.42  16.04  16.34  16.66  17.29   1644    1.0
sigma_a[3]   11.58  3.7e-3   0.34  10.96  11.35  11.58  11.81  12.26   8457    1.0
sigma_a[4]   13.49  3.8e-3   0.37  12.79  13.24  13.49  13.73  14.22   9319    1.0
sigma_a[5]    6.47  2.6e-3   0.23   6.03   6.31   6.46   6.62   6.93   7882    1.0
sigma_a[6]    9.16  3.1e-3    0.3   8.59   8.96   9.15   9.35   9.77   9543    1.0
sigma_a[7]   16.33  4.2e-3    0.4  15.57  16.06  16.32   16.6  17.14   8976    1.0
sigma_a[8]   15.75  4.2e-3   0.38  15.03  15.49  15.75  16.02  16.51   8503    1.0
sigma_a[9]   20.33  4.5e-3   0.44   19.5  20.03  20.33  20.62  21.22   9645    1.0
sigma_a[10]  18.41  4.6e-3   0.41  17.62  18.12   18.4  18.68  19.24   7943    1.0
sigma_a[11]  24.74  4.6e-3   0.44  23.91  24.44  24.74  25.02  25.61   9035    1.0
sigma_a[12]  24.56  5.3e-3   0.47  23.64  24.24  24.55  24.87  25.48   7689    1.0
sigma_a[13]   25.2  4.5e-3   0.44  24.34   24.9  25.21   25.5  26.07   9636    1.0
sigma_a[14]  26.46  4.4e-3   0.44  25.61  26.15  26.45  26.75  27.34   9928    1.0
sigma_a[15]  27.83  5.1e-3   0.47  26.96  27.52  27.82  28.13  28.78   8462    1.0
sigma_a[16]  25.94  4.6e-3   0.45  25.04  25.65  25.94  26.23  26.86   9457    1.0
sigma_a[17]  28.33  4.8e-3   0.46  27.47  28.01  28.31  28.65  29.21   8946    1.0
sigma_a[18]  25.22  5.0e-3   0.47   24.3  24.89  25.21  25.54  26.14   9028    1.0
sigma_a[19]  28.91  4.8e-3   0.45  28.05   28.6   28.9  29.21   29.8   8704    1.0
sigma_a[20]  26.07  5.0e-3   0.46  25.17  25.76  26.07  26.38  26.99   8496    1.0
sigma_a[21]  26.89  4.1e-3   0.45  26.02  26.58  26.88  27.19  27.79  12020    1.0
sigma_a[22]  26.29  4.8e-3   0.44  25.47  25.99  26.29  26.59  27.13   8235    1.0
sigma_a[23]   29.4  5.0e-3   0.46   28.5  29.07  29.39  29.71  30.31   8405    1.0
sigma_a[24]  25.84  4.6e-3   0.45  24.97  25.55  25.84  26.13  26.72   9564    1.0
rate[1]       0.14  5.3e-4   0.02    0.1   0.13   0.14   0.16   0.18   1603    1.0
rate[2]      -0.63  8.6e-4   0.04  -0.71  -0.66  -0.64  -0.61  -0.56   2089    1.0
rate[3]       2.05  1.1e-3   0.04   1.98   2.03   2.05   2.08   2.13   1302    1.0
rate[4]       2.22  1.0e-3   0.04   2.14   2.19   2.22   2.24    2.3   1508    1.0
rate[5]       0.43  9.4e-4   0.03   0.37   0.41   0.43   0.45   0.49   1129    1.0
rate[6]       1.07  9.8e-4   0.03    1.0   1.05   1.07   1.09   1.14   1268    1.0
rate[7]        3.7  9.8e-4   0.04   3.62   3.68    3.7   3.73   3.79   1869    1.0
rate[8]       3.38  1.1e-3   0.04   3.29   3.35   3.38   3.41   3.46   1601    1.0
rate[9]       5.23  1.1e-3   0.05   5.13    5.2   5.23   5.26   5.33   2042    1.0
rate[10]      4.54  1.0e-3   0.05   4.45    4.5   4.53   4.57   4.62   1963    1.0
rate[11]      7.69  1.1e-3   0.05   7.58   7.65   7.69   7.72   7.79   2496    1.0
rate[12]      7.41  1.1e-3   0.05   7.31   7.38   7.41   7.45   7.51   2456    1.0
rate[13]      7.93  1.1e-3   0.06   7.83   7.89   7.93   7.96   8.04   2379    1.0
rate[14]      8.97  1.2e-3   0.06   8.86   8.93   8.97   9.01   9.08   2273    1.0
rate[15]      9.16  1.1e-3   0.06   9.05   9.12   9.16    9.2   9.27   2574    1.0
rate[16]       7.8  1.2e-3   0.06   7.69   7.76    7.8   7.83    7.9   2231    1.0
rate[17]      9.44  1.1e-3   0.06   9.32    9.4   9.44   9.48   9.56   2661    1.0
rate[18]      8.02  1.1e-3   0.06   7.91   7.98   8.02   8.06   8.13   2428    1.0
rate[19]      9.95  1.2e-3   0.06   9.83   9.91   9.95   9.99  10.07   2670    1.0
rate[20]       9.1  1.2e-3   0.05    9.0   9.06    9.1   9.14   9.21   2044    1.0
rate[21]      9.03  1.2e-3   0.06   8.92   8.99   9.03   9.07   9.14   2322    1.0
rate[22]      8.52  1.1e-3   0.06   8.41   8.48   8.52   8.56   8.63   2506    1.0
rate[23]     10.48  1.2e-3   0.06  10.36  10.44  10.48  10.52   10.6   2646    1.0
rate[24]      9.19  1.1e-3   0.05   9.08   9.15   9.19   9.23    9.3   2283    1.0
a0[1]         6.74    0.02   0.61   5.58   6.31   6.72   7.15   7.96   1249    1.0
a0[2]       -13.99  9.8e-3   0.72  -15.4 -14.48  -14.0  -13.5  -12.6   5395    1.0
a0[3]        -1.05  9.7e-3    0.8   -2.6   -1.6  -1.05  -0.51   0.48   6684    1.0
a0[4]        -0.04  9.9e-3   0.85   -1.7  -0.63  -0.04   0.54   1.59   7359    1.0
a0[5]         0.67  6.7e-3   0.61   -0.5   0.25   0.68   1.07   1.85   8275    1.0
a0[6]         0.28  9.0e-3   0.72  -1.15  -0.21   0.27   0.76   1.72   6481    1.0
a0[7]         0.72 10.0e-3   0.84  -0.92   0.16   0.71   1.28    2.4   7075    1.0
a0[8]         1.21    0.01   0.88  -0.56   0.62   1.23   1.81   2.94   6335    1.0
a0[9]        -0.24    0.01   0.93  -2.13  -0.87  -0.24   0.38   1.58   7702    1.0
a0[10]       -0.27  9.9e-3   0.89  -2.04  -0.87  -0.25   0.32   1.47   8118    1.0
a0[11]       -0.82    0.01   0.92  -2.66  -1.44  -0.83  -0.19   0.97   7236    1.0
a0[12]       -0.62    0.01   0.93  -2.46  -1.25  -0.62 1.0e-3   1.21   6916    1.0
a0[13]       -0.29    0.01   0.93  -2.13  -0.93  -0.27   0.33   1.54   6645    1.0
a0[14]        1.28    0.01   0.94   -0.6   0.67    1.3   1.91   3.13   7359    1.0
a0[15]       -0.04    0.01   0.94  -1.87  -0.66  -0.03   0.61   1.78   6673    1.0
a0[16]        1.04    0.01   0.94  -0.81   0.43   1.04   1.68   2.92   7944    1.0
a0[17]        0.11    0.01   0.96  -1.77  -0.55   0.12   0.76   1.97   7197    1.0
a0[18]        1.15    0.01   0.97  -0.81   0.51   1.17    1.8   3.05   7069    1.0
a0[19]        0.41    0.01   0.95  -1.46  -0.22   0.42   1.04   2.26   8162    1.0
a0[20]        1.31    0.01   0.92  -0.53   0.68   1.31   1.92   3.13   6774    1.0
a0[21]        -0.5    0.01   0.94  -2.34  -1.13   -0.5   0.15   1.31   6500    1.0
a0[22]       -0.86    0.01   0.96  -2.77  -1.49  -0.86  -0.24   1.01   8113    1.0
a0[23]        0.34    0.01   0.95  -1.52  -0.31   0.34   0.97   2.24   7406    1.0
a0[24]        0.34    0.01   0.95  -1.51  -0.29   0.34    1.0   2.19   6560    1.0
background   -0.04  9.0e-4   0.03  -0.09  -0.05  -0.04  -0.02   0.01    831   1.01
v0[1]       7.5e-4  2.8e-6 1.1e-4 5.3e-4 6.8e-4 7.5e-4 8.3e-4 9.7e-4   1603    1.0
v0[2]      -3.4e-3  4.6e-6 2.1e-4-3.8e-3-3.5e-3-3.4e-3-3.2e-3-3.0e-3   2089    1.0
v0[3]         0.01  5.6e-6 2.0e-4   0.01   0.01   0.01   0.01   0.01   1302    1.0
v0[4]         0.01  5.5e-6 2.2e-4   0.01   0.01   0.01   0.01   0.01   1508    1.0
v0[5]       2.3e-3  5.0e-6 1.7e-4 1.9e-3 2.2e-3 2.3e-3 2.4e-3 2.6e-3   1129    1.0
v0[6]       5.7e-3  5.2e-6 1.8e-4 5.3e-3 5.5e-3 5.7e-3 5.8e-3 6.0e-3   1268    1.0
v0[7]         0.02  5.2e-6 2.2e-4   0.02   0.02   0.02   0.02   0.02   1869    1.0
v0[8]         0.02  5.8e-6 2.3e-4   0.02   0.02   0.02   0.02   0.02   1601    1.0
v0[9]         0.03  5.7e-6 2.6e-4   0.03   0.03   0.03   0.03   0.03   2042    1.0
v0[10]        0.02  5.5e-6 2.4e-4   0.02   0.02   0.02   0.02   0.02   1963    1.0
v0[11]        0.04  5.6e-6 2.8e-4   0.04   0.04   0.04   0.04   0.04   2496    1.0
v0[12]        0.04  5.6e-6 2.8e-4   0.04   0.04   0.04   0.04   0.04   2456    1.0
v0[13]        0.04  6.0e-6 2.9e-4   0.04   0.04   0.04   0.04   0.04   2379    1.0
v0[14]        0.05  6.2e-6 2.9e-4   0.05   0.05   0.05   0.05   0.05   2273    1.0
v0[15]        0.05  6.0e-6 3.0e-4   0.05   0.05   0.05   0.05   0.05   2574    1.0
v0[16]        0.04  6.2e-6 2.9e-4   0.04   0.04   0.04   0.04   0.04   2231    1.0
v0[17]        0.05  6.1e-6 3.1e-4   0.05   0.05   0.05   0.05   0.05   2661    1.0
v0[18]        0.04  6.0e-6 3.0e-4   0.04   0.04   0.04   0.04   0.04   2428    1.0
v0[19]        0.05  6.1e-6 3.2e-4   0.05   0.05   0.05   0.05   0.05   2670    1.0
v0[20]        0.05  6.4e-6 2.9e-4   0.05   0.05   0.05   0.05   0.05   2044    1.0
v0[21]        0.05  6.2e-6 3.0e-4   0.05   0.05   0.05   0.05   0.05   2322    1.0
v0[22]        0.05  6.0e-6 3.0e-4   0.04   0.04   0.05   0.05   0.05   2506    1.0
v0[23]        0.06  6.3e-6 3.2e-4   0.05   0.06   0.06   0.06   0.06   2646    1.0
lp__        -3.9e4    0.15   5.94 -3.9e4 -3.9e4 -3.9e4 -3.9e4 -3.9e4   1477    1.0

Samples were drawn using NUTS at Fri May 22 12:00:10 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
In [9]:
# Convert to a dataframe
df_stan = samples.to_dataframe()

# Check the results
summarize(df_stan['k_cat'], units='per sec')
Out[9]:
value median 95% CR
0 k_cat (per sec) 1.027 [0.963, 1.096]
In [10]:
summarize(df_stan['K_M'], units='µM')
Out[10]:
value median 95% CR
0 K_M (µM) 11.325 [8.31, 15.634]

The analysis looks good. We notice a few things:

The priors were lognormal (and absolutely enormous) to start. The posterior for $k_\text{cat}$ now looks normal, which means we have enough data to overwhelm the prior and have our samples be distributed more akin to the model (which assumes a normal distribution for $k_\text{cat}$). On the other hand, $K_\text{M}$ is still a bit lognormal, as evidenced by the tail toward higher values. This results in a larger upper bound for $K_\text{M}$ (50% to 97.5%), but the true value still lies within the 95% credible region. The posterior for $K_\text{M}$ has still been greatly informed by the data, however, and hardly resembles the original prior at all. (See below.)

We can see how our samples for the rates at each indole concentration compare to the original data we created.

In [11]:
# Get each rate sample
rate_samples = [col for col in df_stan.columns if 'rate' in col]
df_rates = df_stan[rate_samples]

# Tidy the data
df_rates = df_rates.melt(var_name='sample', value_name='estimated scaled rate')

# Add in concentration info for each sample
df_rates['[indole] (µM)'] = df_rates['sample'].map({sample: conc for sample, conc in zip(rate_samples, concs)})

# Convert rate to k
df_rates['estimated rate (per sec)'] = (((df_rates['estimated scaled rate']/scaling_factor)
                                                                           /epsilon)
                                                                           /(c_Enz/1000))

# Set easier indexing
df_rates.set_index('sample', inplace=True)

# Check
df_rates
INFO:numexpr.utils:NumExpr defaulting to 4 threads.
Out[11]:
estimated scaled rate [indole] (µM) estimated rate (per sec)
sample
rate[1] 0.152237 0.0 0.016110
rate[1] 0.140157 0.0 0.014831
rate[1] 0.177864 0.0 0.018822
rate[1] 0.161084 0.0 0.017046
rate[1] 0.156364 0.0 0.016546
... ... ... ...
rate[24] 9.095454 500.0 0.962482
rate[24] 9.235292 500.0 0.977280
rate[24] 9.133743 500.0 0.966534
rate[24] 9.283638 500.0 0.982396
rate[24] 9.162300 500.0 0.969556

96000 rows × 3 columns

In [12]:
theor_x = np.linspace(0, 500, 1000)
theor_k = MM(k_cat, K_M, theor_x, sigma=0)

p = bokeh.plotting.figure(plot_width=500, plot_height=400)
p.xaxis.axis_label = '[indole] (µM)'
p.yaxis.axis_label = 'k (per sec)'

p.line(theor_x, theor_k, color='black', line_width=2, legend_label='theoretical MM curve')
p.circle(concs, ks, size=8, fill_alpha=0.5, legend_label='actual sampled rates')
    
# Plot the 4000 actual sample values (effectively error bars)
p.square(
    df_rates['[indole] (µM)'],
    df_rates['estimated rate (per sec)'],
    size=2,
    fill_color='black',
    line_color=None,
    fill_alpha=1,
)

# Plot the median values
for sample in df_rates.index.unique():
    _df = df_rates.loc[sample]
    p.circle(
        _df['[indole] (µM)'].median(),
        _df['estimated rate (per sec)'].median(),
        size=6,
        fill_color='orange',
        line_color='black',
        fill_alpha=0.8,
        legend_label='median rate estimates',
    )
    
p.legend.location = 'bottom_right'

bokeh.io.show(p)

Here we see that the points are often a little bit off. This comes a part of the data-generating process in the Stan model that was not used to generate the data, which is background correction. In the full model, it is assumed that backround rate of change for the samples without any indole are the same in all cases (e.g., some sort of background drift). So this value is subtracted from all rate estimates. We see that this isn't the case for the generated data (as we know how we generated it and that the estimates don't align), but that's only because we have the real data. This is impossible to know in a real experiment, thus, we model what we expect to happen in our real experiment as best we can.

Finally, we can use our median estimates for $k_\text{cat}$ and $K_\text{M}$ to plot our median Michaelis-Menten curve estimation (what our theoretical Michaelis-Menten behavior would look like assuming the median estimates were exactly correct).

In [13]:
# Add median estimation
estimated_ks = MM(df_stan['k_cat'].median(), df_stan['K_M'].median(), theor_x, sigma=0)
p.line(theor_x, estimated_ks, color='#004D00', line_width=3, legend_label='median estimated MM curve')

bokeh.io.show(p)

Posterior predictive checks

Part of the model was to use the samples to generate posterior predictive checks, which are shown to give an idea of how predictive the estimated parameters for the given data. This is best shown for the theoretical Michaelis-Menten curve in the form of shaded credible regions, which we'll have as 95% (widest region), 75%, 50%, 25% (narrowest region), and the median line (which should align well with the median estimated MM curve above). We'll overlay all the previous data as well, except the median estimate curve.

In [14]:
p = predictive_regression(df_stan, 'k_ppc')

p.line(theor_x, theor_k, color='black', line_width=2, legend_label='theoretical MM curve')
p.circle(concs, ks, size=8, fill_alpha=0.5, legend_label='actual sampled rates')
    
# Plot the 4000 actual sample values (effectively error bars)
p.square(
    df_rates['[indole] (µM)'],
    df_rates['estimated rate (per sec)'],
    size=2,
    fill_color='black',
    line_color=None,
    fill_alpha=1,
)

# Plot the median values
for sample in df_rates.index.unique():
    _df = df_rates.loc[sample]
    p.circle(
        _df['[indole] (µM)'].median(),
        _df['estimated rate (per sec)'].median(),
        size=6,
        fill_color='orange',
        line_color='black',
        fill_alpha=0.8,
        legend_label='median rate estimates',
    )
    
p.legend.location = 'bottom_right'

bokeh.io.show(p)

(As a note, the jaggedness of the credible regions can be smoothed out by taking more than 4000 samples, but this will take longer and take up much or space in memory or physical memory if the samples are saved.)

Visualizing what we learned.

We'll quickly look at how much our data informed the priors.

In [15]:
xs = np.linspace(0, 2, 1000)

test = st.lognorm(s=2.5, loc=0, scale=np.log(150)).cdf(xs)

p = bokeh.plotting.figure(plot_width=500, plot_height=400)
p.xaxis.axis_label = 'k_cat'
p.yaxis.axis_label = '(E)CDF'

ecdf = ECDF(df_stan['k_cat'].values)
p.circle(ecdf.x, ecdf.y, legend_label='posterior ECDF')
p.line(xs, test, color='black', line_width=3, legend_label='prior CDF')

bokeh.io.show(p)
In [16]:
xs = np.linspace(0, 50, 1000)

test = st.lognorm(s=1.5, loc=0, scale=np.log(500)).cdf(xs)

p = bokeh.plotting.figure(plot_width=500, plot_height=400)
p.xaxis.axis_label = 'K_M'
p.yaxis.axis_label = '(E)CDF'

ecdf = ECDF(df_stan['K_M'].values)
p.circle(ecdf.x, ecdf.y, legend_label='posterior ECDF')
p.line(xs, test, color='black', line_width=3, legend_label='prior CDF')

bokeh.io.show(p)

Real data

Now that we're convinced that we can recover parameters with this model (and assuming we trust the model as a reasonable and useful description of our system), we'll use some actual enzyme kinetics data, as was done to estimate parameters for the paper.

Data workup

First we'll deal with cleaning up the raw absorbance data.

In [17]:
df = pd.read_excel('example_kinetics_data.xlsx')

df.head()
Out[17]:
Time ( Second ) 50nM_GR260_0uM_indole_1 - RawData 50nM_GR260_0uM_indole_2 - RawData 50nM_GR260_1.5625uM_indole_1 - RawData 50nM_GR260_1.5625uM_indole_2 - RawData 50nM_GR260_3.125uM_indole_1 - RawData 50nM_GR260_3.125uM_indole_2 - RawData 50nM_GR260_6.25uM_indole_1 - RawData 50nM_GR260_6.25uM_indole_2 - RawData 50nM_GR260_12.5uM_indole_1 - RawData ... 50nM_GR260_200uM_indole_1 - RawData 50nM_GR260_200uM_indole_2 - RawData 50nM_GR260_300uM_indole_1 - RawData 50nM_GR260_300uM_indole_2 - RawData 50nM_GR260_400uM_indole_1 - RawData 50nM_GR260_400uM_indole_2 - RawData 50nM_GR260_400uM_indole_3 - RawData 50nM_GR260_500uM_indole_1 - RawData 50nM_GR260_500uM_indole_2 - RawData blank - RawData
0 0.0 0.0042 0.0039 0.0067 0.0066 0.0119 0.0122 0.0168 0.0169 0.0283 ... 0.4052 0.4121 0.6051 0.6120 0.8062 0.8059 0.8074 0.9969 1.0027 0.0001
1 0.2 0.0044 0.0038 0.0067 0.0068 0.0120 0.0121 0.0168 0.0170 0.0283 ... 0.4052 0.4122 0.6051 0.6122 0.8060 0.8068 0.8076 0.9969 1.0028 0.0001
2 0.4 0.0042 0.0039 0.0068 0.0068 0.0120 0.0123 0.0168 0.0171 0.0283 ... 0.4052 0.4121 0.6050 0.6118 0.8060 0.8064 0.8078 0.9973 1.0027 0.0001
3 0.6 0.0043 0.0038 0.0067 0.0065 0.0120 0.0123 0.0168 0.0170 0.0282 ... 0.4052 0.4123 0.6048 0.6116 0.8060 0.8072 0.8075 0.9969 1.0026 0.0000
4 0.8 0.0043 0.0039 0.0068 0.0066 0.0121 0.0121 0.0168 0.0170 0.0282 ... 0.4052 0.4123 0.6050 0.6115 0.8062 0.8072 0.8073 0.9964 1.0028 0.0000

5 rows × 27 columns

Clean up the column names.

(Note: multiple columns with the same name in a pandas DataFrame is not recommended; only momentarily using it here to prepare numpy arrays.)

In [18]:
# Instantiate list
new_cols = [None for _ in df.columns]

for i, column in enumerate(df.columns):
    
    # Standardize time label
    if column == 'Time ( Second )':
        new_cols[i] = 'Time (s)'
        
    # Clean up experimental names
    else:
        working_string = column.replace(' - RawData', '')
        split_string = working_string.split('_')

        if len(split_string) == 1:
            working_string = split_string[0]
        else:
            working_string = split_string[2].replace('uM', '')
            
        new_cols[i] = working_string

# Rename the columns
df.columns = new_cols

# Check
df.head()
Out[18]:
Time (s) 0 0 1.5625 1.5625 3.125 3.125 6.25 6.25 12.5 ... 200 200 300 300 400 400 400 500 500 blank
0 0.0 0.0042 0.0039 0.0067 0.0066 0.0119 0.0122 0.0168 0.0169 0.0283 ... 0.4052 0.4121 0.6051 0.6120 0.8062 0.8059 0.8074 0.9969 1.0027 0.0001
1 0.2 0.0044 0.0038 0.0067 0.0068 0.0120 0.0121 0.0168 0.0170 0.0283 ... 0.4052 0.4122 0.6051 0.6122 0.8060 0.8068 0.8076 0.9969 1.0028 0.0001
2 0.4 0.0042 0.0039 0.0068 0.0068 0.0120 0.0123 0.0168 0.0171 0.0283 ... 0.4052 0.4121 0.6050 0.6118 0.8060 0.8064 0.8078 0.9973 1.0027 0.0001
3 0.6 0.0043 0.0038 0.0067 0.0065 0.0120 0.0123 0.0168 0.0170 0.0282 ... 0.4052 0.4123 0.6048 0.6116 0.8060 0.8072 0.8075 0.9969 1.0026 0.0000
4 0.8 0.0043 0.0039 0.0068 0.0066 0.0121 0.0121 0.0168 0.0170 0.0282 ... 0.4052 0.4123 0.6050 0.6115 0.8062 0.8072 0.8073 0.9964 1.0028 0.0000

5 rows × 27 columns

In [19]:
# t0 subtract and transpose to simplify operations
t0_sub_values = (df.values - df.values[0]).T

# Ignore Time (column 0) and blank (column -1) columns
full_absorbances = t0_sub_values[1:-1]

# Convert AU to mAU
absorbances = full_absorbances*1000

# Check
absorbances
Out[19]:
array([[ 0. ,  0.2,  0. , ..., -0.4, -0.5, -0.4],
       [ 0. , -0.1,  0. , ..., -0.1, -0.1, -0.2],
       [ 0. ,  0. ,  0.1, ...,  1.4,  1.4,  1.4],
       ...,
       [ 0. ,  0.2,  0.4, ...,  6.5,  6.9,  6.6],
       [ 0. ,  0. ,  0.4, ...,  6.8,  7.2,  6.4],
       [ 0. ,  0.1,  0. , ...,  7.7,  8.5,  8.4]])
In [20]:
# Plot to check
concs = df.columns[1:-1].astype(float)
times = df['Time (s)'].values

p = bokeh.plotting.figure(plot_width=800, plot_height=400)
p.xaxis.axis_label = 'Time (s)'
p.yaxis.axis_label = 'mAU'

for i, absorbance in enumerate(absorbances):
    color = bokeh.palettes.magma(27)
    p.line(times, absorbance, color=color[i], legend_label=str(concs[i]))

bokeh.io.show(p)

Here we see that the real data looks remarkably like generated data (with some more real-world inconsistencies), suggesting that our data-generating process (and our overall model) is reasonable. We can proceed.

Shorten the timecourse to 1 minute, from 10 seconds to 70 seconds.

In [21]:
df_min = df[(df['Time (s)'] >= 10) & (df['Time (s)'] <= 70)].copy()

df_min.head()
Out[21]:
Time (s) 0 0 1.5625 1.5625 3.125 3.125 6.25 6.25 12.5 ... 200 200 300 300 400 400 400 500 500 blank
50 10.0 0.0042 0.0039 0.0071 0.0067 0.0125 0.0121 0.0172 0.0174 0.0291 ... 0.4058 0.4130 0.6055 0.6110 0.8069 0.8082 0.8074 0.9979 1.0040 0.0000
51 10.2 0.0043 0.0039 0.0071 0.0067 0.0124 0.0122 0.0171 0.0173 0.0288 ... 0.4060 0.4131 0.6056 0.6108 0.8071 0.8076 0.8074 0.9980 1.0044 0.0000
52 10.4 0.0043 0.0040 0.0070 0.0069 0.0125 0.0122 0.0172 0.0173 0.0290 ... 0.4058 0.4131 0.6055 0.6110 0.8070 0.8081 0.8073 0.9984 1.0035 -0.0001
53 10.6 0.0043 0.0039 0.0071 0.0070 0.0125 0.0122 0.0171 0.0176 0.0289 ... 0.4059 0.4131 0.6056 0.6112 0.8066 0.8086 0.8076 0.9977 1.0037 -0.0002
54 10.8 0.0043 0.0041 0.0071 0.0068 0.0126 0.0121 0.0172 0.0174 0.0288 ... 0.4060 0.4131 0.6055 0.6111 0.8069 0.8085 0.8072 0.9983 1.0037 -0.0001

5 rows × 27 columns

In [22]:
# Prepare values
absorbances = (df_min.values - df_min.values[0]).T[1:-1]*1000
concs = df_min.columns[1:-1].astype(float)
times = df_min['Time (s)'].values

# Plot
p = bokeh.plotting.figure(plot_width=800, plot_height=400)
p.xaxis.axis_label = 'Time (s)'
p.yaxis.axis_label = 'mAU'

for i, absorbance in enumerate(absorbances):
    color = bokeh.palettes.magma(len(concs)+2)
    p.line(times, absorbance, color=color[i], legend_label=str(concs[i]))

bokeh.io.show(p)

Model the system

In [23]:
# Multiply absorbances by 100 to give rates of ~1
# This speeds up modeling, and is later factored in
scaling_factor = 100
scaled_absorbances = absorbances*scaling_factor

# Enzyme concentration, in nM
c_Enz = 50

# Store as dictionary
data = dict(
    N=len(times),
    M=len(concs),
    M0=sum(concs == 0),
    t=times,
    conc=concs,
    a=scaled_absorbances,
    scaling_factor=scaling_factor,
    epsilon=1.89,
    c_Enz=c_Enz,
    max_conc=int(max(concs)),
    conc_ppc=range(0, int(max(concs)+1)),
)

# Sample; use more iterations and warm-up for better results
# This should take ~1 minute on a CPU
samples = model.sampling(data)

# Check out a summary
samples
WARNING:pystan:Truncated summary with the 'fit.__repr__' method. For the full summary use 'print(fit)'
Out[23]:
Warning: Shown data is truncated to 100 parameters
For the full summary use 'print(fit)'

Inference for Stan model: anon_model_9e3ec6aa343e7a351ee4db039acd0f93.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

              mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
V_max         0.03  1.1e-5 8.6e-4   0.03   0.03   0.03   0.03   0.03   5740    1.0
K_M           5.67    0.01    1.0   3.94    5.0   5.58   6.24   7.88   4957    1.0
sigma_k       0.05  1.3e-4 8.6e-3   0.04   0.05   0.05   0.06   0.07   4594    1.0
sigma_a[1]   10.85  4.4e-3   0.29   10.3  10.65  10.84  11.04  11.44   4337    1.0
sigma_a[2]     8.7  3.4e-3   0.24   8.23   8.53   8.69   8.85   9.19   5009    1.0
sigma_a[3]    8.38  3.9e-3   0.31    7.8   8.16   8.37   8.58   9.01   6194    1.0
sigma_a[4]    8.17  3.3e-3   0.28   7.63   7.97   8.17   8.36   8.73   7504    1.0
sigma_a[5]    9.96  3.8e-3   0.33   9.33   9.75   9.96  10.18  10.62   7476    1.0
sigma_a[6]    9.47  3.9e-3   0.32   8.87   9.24   9.46   9.67  10.11   6668    1.0
sigma_a[7]   12.61  5.4e-3    0.4  11.83  12.33   12.6  12.87  13.43   5563    1.0
sigma_a[8]   12.15  5.2e-3    0.4   11.4  11.87  12.14  12.43  12.97   6015    1.0
sigma_a[9]    19.2  5.4e-3   0.46  18.32  18.89   19.2  19.51  20.15   7301    1.0
sigma_a[10]  11.54  4.8e-3   0.38  10.81  11.28  11.53   11.8   12.3   6368    1.0
sigma_a[11]  17.04  5.6e-3   0.43  16.21  16.76  17.04  17.33  17.93   6094    1.0
sigma_a[12]   14.4  5.8e-3   0.42  13.56  14.11  14.39  14.67  15.24   5413    1.0
sigma_a[13]  13.03  5.1e-3    0.4   12.3  12.76  13.03  13.29  13.83   6030    1.0
sigma_a[14]  15.38  6.0e-3   0.44  14.56  15.08  15.38  15.68  16.27   5454    1.0
sigma_a[15]  15.98  5.2e-3   0.43  15.16  15.69  15.97  16.27  16.85   6991    1.0
sigma_a[16]  15.43  5.8e-3   0.44  14.58  15.13  15.42  15.73   16.3   5695    1.0
sigma_a[17]   22.5  6.2e-3   0.46   21.6  22.19   22.5   22.8  23.42   5579    1.0
sigma_a[18]  21.27  5.5e-3   0.45   20.4  20.97  21.24  21.57  22.18   6651    1.0
sigma_a[19]  16.82  5.4e-3   0.43  15.98  16.53  16.82  17.11   17.7   6356    1.0
sigma_a[20]  22.17  5.2e-3   0.47  21.25  21.85  22.16  22.49  23.09   7951    1.0
sigma_a[21]  24.49  5.1e-3   0.46  23.58  24.18  24.49  24.79   25.4   8073    1.0
sigma_a[22]  18.76  4.4e-3   0.41  17.96  18.49  18.75  19.03  19.59   8717    1.0
sigma_a[23]  24.12  5.0e-3   0.43  23.28  23.83  24.12   24.4  24.96   7241    1.0
sigma_a[24]  23.74  5.3e-3   0.45  22.87  23.43  23.74  24.05  24.65   7233    1.0
sigma_a[25]  25.65  5.3e-3   0.47  24.77  25.34  25.65  25.98  26.57   7787    1.0
rate[1]      -0.13  3.1e-4   0.02  -0.17  -0.14  -0.13  -0.11  -0.09   3966    1.0
rate[2]       0.08  2.5e-4   0.02   0.05   0.07   0.08   0.09   0.11   4058    1.0
rate[3]       1.43  5.2e-4   0.03   1.38   1.41   1.43   1.44   1.48   2452    1.0
rate[4]       2.03  5.2e-4   0.02   1.98   2.01   2.03   2.04   2.08   2215    1.0
rate[5]       1.95  5.5e-4   0.03    1.9   1.93   1.95   1.97    2.0   2419    1.0
rate[6]       2.91  5.5e-4   0.03   2.86    2.9   2.91   2.93   2.97   2453    1.0
rate[7]       3.56  6.0e-4   0.03    3.5   3.54   3.56   3.58   3.62   2656    1.0
rate[8]       3.42  5.6e-4   0.03   3.37    3.4   3.42   3.44   3.48   2877    1.0
rate[9]       3.29  6.2e-4   0.04   3.22   3.27   3.29   3.32   3.36   3458    1.0
rate[10]      3.74  5.9e-4   0.03   3.68   3.72   3.74   3.76    3.8   2504    1.0
rate[11]      4.45  5.7e-4   0.03   4.38   4.43   4.45   4.47   4.52   3564    1.0
rate[12]       4.5  6.3e-4   0.03   4.43   4.47    4.5   4.52   4.56   2651    1.0
rate[13]      5.53  5.8e-4   0.03   5.47   5.51   5.53   5.55   5.59   2810    1.0
rate[14]      4.77  6.3e-4   0.03    4.7   4.74   4.77   4.79   4.83   2853    1.0
rate[15]      4.73  6.1e-4   0.03   4.66    4.7   4.73   4.75   4.79   2976    1.0
rate[16]      5.54  6.7e-4   0.03   5.47   5.52   5.54   5.56   5.61   2656    1.0
rate[17]      5.82  6.8e-4   0.04   5.74    5.8   5.82   5.85    5.9   3562    1.0
rate[18]      6.23  6.1e-4   0.04   6.16   6.21   6.23   6.26   6.31   3884    1.0
rate[19]      5.91  6.1e-4   0.03   5.85   5.89   5.91   5.94   5.98   3149    1.0
rate[20]      6.38  6.1e-4   0.04    6.3   6.35   6.38    6.4   6.45   4166    1.0
rate[21]      6.05  6.8e-4   0.04   5.96   6.02   6.05   6.08   6.13   3870    1.0
rate[22]      5.43  6.3e-4   0.04   5.36   5.41   5.43   5.45    5.5   3165    1.0
rate[23]      6.77  6.6e-4   0.04   6.69   6.74   6.77    6.8   6.85   3720    1.0
rate[24]      5.47  7.0e-4   0.04   5.39   5.44   5.47    5.5   5.55   3453    1.0
rate[25]      5.94  6.9e-4   0.04   5.86   5.91   5.94   5.97   6.03   3759    1.0
a0[1]        -2.84    0.01   0.63  -4.03  -3.28  -2.82   -2.4  -1.62   2528    1.0
a0[2]         9.86    0.01   0.59   8.68   9.46   9.85  10.26  11.04   2350    1.0
a0[3]        -6.32    0.01   0.81  -7.94  -6.88  -6.31  -5.77  -4.74   6174    1.0
a0[4]         -2.2  9.5e-3   0.76   -3.7  -2.72   -2.2  -1.68  -0.72   6524    1.0
a0[5]         -3.7    0.01   0.84  -5.31  -4.29  -3.71  -3.14  -2.05   5326    1.0
a0[6]        -6.03    0.01   0.85  -7.65  -6.59  -6.05  -5.44  -4.36   5881    1.0
a0[7]        -9.37    0.01    1.0 -11.34 -10.04  -9.38   -8.7   -7.4   4916    1.0
a0[8]        -9.27    0.01   0.97 -11.22  -9.92  -9.29  -8.61  -7.38   5554    1.0
a0[9]        -8.89    0.01    1.0 -10.86  -9.59  -8.87  -8.21  -6.94   5986    1.0
a0[10]       -8.48    0.01   0.95 -10.33  -9.13   -8.5  -7.85  -6.62   4688    1.0
a0[11]       -8.51    0.01   0.97 -10.47  -9.16  -8.51  -7.84  -6.64   5798    1.0
a0[12]        -9.6    0.01   1.01 -11.61 -10.27   -9.6  -8.91  -7.67   4856    1.0
a0[13]       -8.46    0.01   0.97 -10.33  -9.12  -8.45  -7.81  -6.54   6084    1.0
a0[14]       -9.27    0.01    1.0 -11.25  -9.94  -9.26  -8.58  -7.35   5245    1.0
a0[15]       -9.04    0.01    1.0  -11.0  -9.71  -9.06  -8.37  -7.01   5964    1.0
a0[16]       -9.14    0.01   1.04 -11.15  -9.84  -9.16  -8.44  -7.06   5371    1.0
a0[17]       -7.47    0.01   1.04  -9.49  -8.16  -7.47  -6.79  -5.39   6236    1.0
a0[18]       -8.24    0.01   0.96 -10.07  -8.89  -8.23   -7.6  -6.35   5749    1.0
a0[19]       -7.16    0.01    1.0   -9.1  -7.83  -7.18  -6.51  -5.15   5179    1.0
a0[20]       -7.15    0.01   1.01  -9.19  -7.82  -7.13  -6.48  -5.18   7737    1.0
a0[21]       -6.85    0.01   1.03  -8.89  -7.53  -6.84  -6.18  -4.84   5917    1.0
a0[22]       -2.56    0.01   0.94  -4.42   -3.2  -2.54  -1.92  -0.76   6801    1.0
a0[23]       -3.33    0.01   0.93  -5.16  -3.95  -3.33  -2.71  -1.51   6479    1.0
a0[24]       -4.28    0.01   0.98  -6.17  -4.96  -4.26  -3.59   -2.4   6628    1.0
a0[25]       -4.71    0.01   0.99  -6.62  -5.38  -4.71  -4.04  -2.79   6347    1.0
background   -0.34  4.2e-4   0.02  -0.37  -0.35  -0.34  -0.33  -0.31   1324    1.0
v0[1]      -6.7e-4  1.6e-6 1.0e-4-8.7e-4-7.3e-4-6.6e-4-6.0e-4-4.7e-4   3966    1.0
v0[2]       4.2e-4  1.3e-6 8.5e-5 2.6e-4 3.7e-4 4.2e-4 4.8e-4 5.9e-4   4058    1.0
v0[3]       7.5e-3  2.7e-6 1.4e-4 7.3e-3 7.5e-3 7.5e-3 7.6e-3 7.8e-3   2452    1.0
v0[4]         0.01  2.8e-6 1.3e-4   0.01   0.01   0.01   0.01   0.01   2215    1.0
v0[5]         0.01  2.9e-6 1.4e-4   0.01   0.01   0.01   0.01   0.01   2419    1.0
v0[6]         0.02  2.9e-6 1.4e-4   0.02   0.02   0.02   0.02   0.02   2453    1.0
v0[7]         0.02  3.2e-6 1.6e-4   0.02   0.02   0.02   0.02   0.02   2656    1.0
v0[8]         0.02  2.9e-6 1.6e-4   0.02   0.02   0.02   0.02   0.02   2877    1.0
v0[9]         0.02  3.3e-6 1.9e-4   0.02   0.02   0.02   0.02   0.02   3458    1.0
v0[10]        0.02  3.1e-6 1.6e-4   0.02   0.02   0.02   0.02   0.02   2504    1.0
v0[11]        0.02  3.0e-6 1.8e-4   0.02   0.02   0.02   0.02   0.02   3564    1.0
v0[12]        0.02  3.3e-6 1.7e-4   0.02   0.02   0.02   0.02   0.02   2651    1.0
v0[13]        0.03  3.1e-6 1.6e-4   0.03   0.03   0.03   0.03   0.03   2810    1.0
v0[14]        0.03  3.3e-6 1.8e-4   0.02   0.03   0.03   0.03   0.03   2853    1.0
v0[15]        0.03  3.2e-6 1.8e-4   0.02   0.02   0.03   0.03   0.03   2976    1.0
v0[16]        0.03  3.5e-6 1.8e-4   0.03   0.03   0.03   0.03   0.03   2656    1.0
v0[17]        0.03  3.6e-6 2.1e-4   0.03   0.03   0.03   0.03   0.03   3562    1.0
v0[18]        0.03  3.2e-6 2.0e-4   0.03   0.03   0.03   0.03   0.03   3884    1.0
v0[19]        0.03  3.2e-6 1.8e-4   0.03   0.03   0.03   0.03   0.03   3149    1.0
v0[20]        0.03  3.2e-6 2.1e-4   0.03   0.03   0.03   0.03   0.03   4166    1.0
lp__        -3.4e4    0.17   6.42 -3.4e4 -3.4e4 -3.4e4 -3.4e4 -3.4e4   1421    1.0

Samples were drawn using NUTS at Fri May 22 12:01:27 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
In [24]:
# Convert to a dataframe
df_stan = samples.to_dataframe()

# Check
df_stan
Out[24]:
chain draw warmup V_max K_M sigma_k sigma_a[1] sigma_a[2] sigma_a[3] sigma_a[4] ... k_ppc[499] k_ppc[500] k_ppc[501] lp__ accept_stat__ stepsize__ treedepth__ n_leapfrog__ divergent__ energy__
0 0 0 0 0.032405 5.357302 0.070881 11.280008 8.671507 8.811607 7.774412 ... 0.599899 0.689708 0.658650 -33632.854557 0.962765 0.221389 4 15 0 33668.655415
1 0 1 0 0.030951 7.232506 0.051152 10.618796 8.229086 7.872670 8.639071 ... 0.579037 0.709276 0.552583 -33633.728229 0.945289 0.221389 4 15 0 33667.919956
2 0 2 0 0.030199 5.713886 0.055922 11.226554 9.013489 8.848639 7.824554 ... 0.650568 0.579673 0.637679 -33631.914565 0.944350 0.221389 5 47 0 33671.157231
3 0 3 0 0.029036 3.972844 0.054436 11.055111 8.628888 8.443295 8.035728 ... 0.508982 0.574513 0.626114 -33634.253768 0.989332 0.221389 4 15 0 33668.246936
4 0 4 0 0.031613 5.293299 0.055740 10.732535 8.192769 8.586796 7.399953 ... 0.585135 0.596188 0.567067 -33641.860064 0.959428 0.221389 4 15 0 33675.896305
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3995 3 995 0 0.030336 5.350727 0.065329 10.447966 8.644256 7.687465 7.667499 ... 0.563212 0.636558 0.474744 -33647.163319 0.889179 0.195267 4 15 0 33678.747304
3996 3 996 0 0.032326 5.465688 0.069696 10.783132 8.691680 8.263555 7.992564 ... 0.572372 0.677749 0.715661 -33639.509174 0.982053 0.195267 5 31 0 33681.342284
3997 3 997 0 0.031597 4.234075 0.053463 10.979560 8.555790 8.333706 8.348188 ... 0.570749 0.584545 0.608496 -33633.150851 0.963340 0.195267 5 31 0 33683.079669
3998 3 998 0 0.032377 4.913828 0.055110 10.560956 8.830692 8.386619 7.951147 ... 0.581586 0.683295 0.744881 -33639.762969 0.998394 0.195267 5 31 0 33674.945102
3999 3 999 0 0.031503 4.951312 0.054396 11.049655 8.550517 8.242909 8.328842 ... 0.602536 0.605697 0.652281 -33637.459120 0.710577 0.195267 5 31 0 33674.862651

4000 rows × 641 columns

Efficient storage, if needed.

In [25]:
# # Write out a ~20 MB file
# df_stan.to_parquet('df_stan.parquet', compression='gzip')

# # Read in the file
# df_stan = pd.read_parquet('df_stan.parquet')

# # Check again
# df_stan

Data analysis

In [26]:
# Get summary stats
summarize(df_stan['k_cat'], units='per sec')
Out[26]:
value median 95% CR
0 k_cat (per sec) 0.625 [0.591, 0.659]
In [27]:
summarize(df_stan['K_M'], units='µM')
Out[27]:
value median 95% CR
0 K_M (µM) 5.581 [3.944, 7.876]
In [28]:
# Get each rate sample
rate_samples = [col for col in df_stan.columns if 'rate' in col]
df_rates = df_stan[rate_samples]

# Tidy the data
df_rates = df_rates.melt(var_name='sample', value_name='estimated scaled rate')

# Add in concentration info for each sample
df_rates['[indole] (µM)'] = df_rates['sample'].map({sample: conc for sample, conc in zip(rate_samples, concs)})

# Convert rate to k
df_rates['estimated rate (per sec)'] = (((df_rates['estimated scaled rate']/scaling_factor)
                                                                           /epsilon)
                                                                           /(c_Enz/1000))

# Set easier indexing
df_rates.set_index('sample', inplace=True)

# Generate the predictive regression plot
p = predictive_regression(df_stan, 'k_ppc')
    
# Plot the 4000 actual sample values (effectively error bars)
p.square(
    df_rates['[indole] (µM)'],
    df_rates['estimated rate (per sec)'],
    size=2,
    fill_color='black',
    line_color=None,
    fill_alpha=1,
)

# Plot the median values
for sample in df_rates.index.unique():
    _df = df_rates.loc[sample]
    p.circle(
        _df['[indole] (µM)'].median(),
        _df['estimated rate (per sec)'].median(),
        size=6,
        fill_color='orange',
        line_color='black',
        fill_alpha=0.8,
        legend_label='median rate estimates',
    )
    
p.legend.location = 'bottom_right'

bokeh.io.show(p)
In [ ]: